IDF term weighting and IR research lessons
نویسنده
چکیده
Robertson comments on the theoretical status of IDF term weighting. Its history illustrates how ideas develop in a specific research context, in theory/experiment interaction, and in operational practice. It is an honour to have the small proposal for term weighting that I published more than thirty years ago (Sparck Jones 1972) the subject of Stephen Robertson’s paper (Robertson 2004). I would like to comment on some points that I see as suggesting lessons for information retrieval research. First, the context that prompted the proposal. The proposal came from trying to explain why earlier ideas about how to do automatic indexing did not work. They were plausible in themselves, but had quite different objectives. My previous research had concentrated on automatic methods for constructing term classifications intended, by analogy with manual thesauri, as recall-promoting devices. Classes were based on term cooccurrences in documents, following the generic statistical approach to retrieval initially suggested by Luhn, and applied within the coordination level matching framework. But these classifications, on Cleverdon’s Cranfield data and using the test and evaluation methods that he and Salton had established, did not deliver the predicted improvements in retrieval performance. The best performance was obtained with (necessarily) small groups of very similar terms. Trying to understand what was happening in detail showed that terms that occurred in many documents dominated the classes. Thus anything that increased their matching potential, as term substitution did, would inevitably retrieve non-relevant rather more than relevant documents. However these frequent terms were also common in requests, and simply removing them, as advocated by Svenonius, could have a damaging effect on performance. The natural implication was therefore that less frequent terms should be grouped but more frequent ones should be confined to singleton classes. This could give better performance than terms alone, but not for all test collections. What all this suggested was that it might be more profitable to concentrate on the frequency behaviour of terms, and forget about classes. More specifically, it led to the idea that all terms should be allowed to match but the value of matches on frequent terms should be lower than that for non-frequent terms. Roger Needham, who had earlier worked on statistically-based methods of indexing and retrieval, was easily able as a mathematician to suggest an appropriate simple formula that smoothly damped down frequency and was shown to work reliably and usefully for different collections. My proposal for weighting was thus a direct response to the results of the kind of systematic retrieval testing that Cleverdon did so much to establish. It was also, when compared with Salton’s work on automatic indexing, a product of subtly different data conditions. Salton
منابع مشابه
Inverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization
Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) int...
متن کاملInverse Category Frequency based supervised term weighting scheme for text categorization
Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) int...
متن کاملWeb Information Retrieval using WordNet
Information retrieval (IR) is the area of study concerned with searching documents or information within documents. The user describes information needs with a query which consists of a number of words. Finding weight of a query term is useful to determine the importance of a query. Calculating term importance is fundamental aspect of most information retrieval approaches and it is traditionall...
متن کاملA Novel Term Weighting Scheme for a Fuzzy Logic Based Intelligent Web Agent
Term Weighting (TW) is one of the most important tasks for Information Retrieval (IR). To solve the TW problem, many authors have considered Vector Space Model, and specifically, they have used the TF-IDF method. As this method does not take into account some of the features of terms, we propose a novel alternative fuzzy logic based method for TW in IR. TW is an essential task for the Web Intel...
متن کاملWeighting in Information Retrieval Using Genetic Programming: A Three Stage Process
This paper presents term-weighting schemes that have been evolved using genetic programming in an adhoc Information Retrieval model. We create an entire term-weighting scheme by firstly assuming that term-weighting schemes contain a global part, a term-frequency influence part and a normalisation part. By separating the problem into three distinct phases we reduce the search space and ease the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Documentation
دوره 60 شماره
صفحات -
تاریخ انتشار 2004